Milestone 1
05 Oct 2023Question 1
The play-by-play data was downloaded from the public NHL APIs for Regular season and Playoffs. The following NHL endpoint was used to download the data:
https://statsapi.web.nhl.com/api/v1/game/[GAME_ID]/feed/live/
The GAME_ID path parameter was constructed for each game in Regular season and Playoffs from the unofficial API documentation of NHL APIs for all games in seasons from 2016-17 to 2020-21.

The IDs maintain a specific structure which is as follows:
- All the IDs consist of 10 digits.
- First 4 digits ascertain the season in which the game was played. For example, all games played in 2018-19 season will start with 2018.
- The next 2 digits ascertain the type of game i.e. whether the game was played in Preseason, Regular season, Playoffs or All Star.
- The last 4 digits ascertain the specific game number in the season denoted by the preceding 2 digits.
For example, the following Python function was used to generate the game IDs.
def get_game_id(self, season: str, game_type: str, game_number: str):
return f'{season}{game_type}{str(game_number).zfill(4)}'
The game_number parameter in the above function was constructed a bit
differently for Regular season and Playoff games.
Regular Season
The ID was fairly straightforward for Regular season games. Here’s how it was constructed:
- Get the season in which the game was played (e.g. 2016, 2017, 2018, 2019, 2020).
- Get the corresponding number for the type of game (‘02’ in this case).
- Get the game number which can range from 1 to the total number of games in a given season.
For example, the ID 2018020143 denotes the game number ‘143’
played in 2018 Regular season (Vancouver Canucks vs Arizona Coyotes).
Following Python snippet generates the game ID and obtains the corresponding data for Regular season games:
# Loop in a single hockey season e.g. 2016 (2016-17 season), 2020 (2020-21 season)
for season, games in seasons_to_game_volume_map.items():
# Loop inside a particular game type i.e. regular or playoffs
for game_type in game_types:
# Check the game type. If it is regular then the last 4 digits
# should be the game number
if game_type == Gametype.REGULAR.name:
for game_number in range(1, games + 1):
game_id = self.get_game_id(season, Gametype.REGULAR.value, str(game_number))
self.scrape_data(game_id, Gametype.REGULAR.name, loc)
Playoffs
The Playoffs consist of 4 rounds with the 1st round having 8 matchups, 2nd round having 4 matchups,
3rd round having 2 matchups and the final round having 1 matchup.
Also, each of the matchups can have 7 games, out of which games 5, 6 and 7 do not necessarily have to be played.
Here’s how the game_number parameter was constructed for Playoff games:
- First 2 digits specify the round (i.e. ‘01’, ‘02’, ‘03’, ‘04’)
- The 3rd digit is the matchup number. It can range from 1 to 8 for round 1, 1 to 4 for round 2, 1 to 2 for round 3 and 1 for round 4.
- The 4th digit is the game number. For each matchup, it can range from 1 to 7.
For example, the game ID 2017030314 denotes the 4th game of the 1st matchup in
the 3rd round of 2017-18 Playoffs (Tampa Bay Lightning vs Washington Capitals).
Note: The first 6 digits of the Playoffs game ID follow the same pattern as the Regular season games.
The only difference is that the number for type of game is ‘03’.
Following Python snippet generates the game ID and obtains the corresponding data for Playoff games:
else:
total_match_ups = 8
round_num = 1
# Continually divide total_match_ups as after each round
# half of the teams are eliminated
while total_match_ups != 0:
for match_up in range(1, total_match_ups + 1):
for game_number in range(1, 8):
game_id = self.get_game_id(season, Gametype.PLAYOFFS.value,
f'{str(round_num).zfill(2)}{match_up}{game_number}')
self.scrape_data(game_id, Gametype.PLAYOFFS.name, loc)
total_match_ups = total_match_ups // 2
round_num += 1
The class scrape_nhl_data does the work of downloading all the data. Following is the code for the same:
class scrape_nhl_data:
def get_game_id(self, season: str, game_type: str, game_number: str):
return f'{season}{game_type}{str(game_number).zfill(4)}'
def write_data(self, loc: str, season: str, content: Union[SupportsIndex, slice]):
if season != Gametype.REGULAR.name and 'endDateTime' not in content['gameData']['datetime']:
return
with open(f'{loc}.json', 'w+', encoding='utf-8') as f:
json.dump(content, f, ensure_ascii=False, indent=4)
def scrape_data(self, game_id, game_type, path):
endpoint = f'https://statsapi.web.nhl.com/api/v1/game/{game_id}/feed/live/'
try:
time.sleep(0.5)
res = req.get(endpoint)
res.raise_for_status()
self.write_data(f'{path}/{game_id}', game_type, res.json())
except req.exceptions.HTTPError as err:
print(f'API failed for {game_id} with status code {err.response.status_code}')
except Exception as e:
print(f'{game_type} trace: {endpoint} {game_id}')
print(e)
def get_play_by_play_data(self,
path: str,
seasons_to_game_volume_map: dict,
game_types: list
):
"""Created folders and individual json files for
Arguments:
path (str): Location where the files should be created.
Ideally it should be the 'data' folder of our repository.
Note: Do not precede the path with a '/'. If the data
needs to be saved in the same directory as this script then
pass an empty string ''.
seasons_to_game_volume_map (dict of str: int): Map of seasons
for which the data is required and the corresponding number of
games in that season. For e.g. it will have the key as '2016'
for the 2016-17 season.
game_types (dict of str: str): List of game types for which
data needs to be retrieved.
Return:
Folder containing data for each hockey season. These folders in
turn contain play-by-play data for regular and playoff games.
"""
# Loop in a single hockey season e.g. 2016 (2016-17 season), 2020 (2020-21 season)
for season, games in seasons_to_game_volume_map.items():
# Loop inside a particular game type i.e. regular or playoffs
for game_type in game_types:
if len(path.strip()) == 0:
loc = f'{season}/{game_type}'
else:
loc = f'{path}/{season}/{game_type}'
if not os.path.exists(loc):
os.makedirs(loc)
# Check the game type. If it is regular then the last 4 digits
# should be the game number
if game_type == Gametype.REGULAR.name:
for game_number in range(1, games + 1):
game_id = self.get_game_id(season, Gametype.REGULAR.value, str(game_number))
self.scrape_data(game_id, Gametype.REGULAR.name, loc)
# Otherwise, game_type == 'playoff' and the last 4 digits
# should be composed as follows:
# first 2 digits -> round number (can be 01, 02, 03, 04)
# third digit -> match up (can be upto 8, 4, 2, 1 for
# the above mentioned round numbers)
# fourth digit -> game number (can be from 1 to 7)
else:
total_match_ups = 8
round_num = 1
# Continually divide total_match_ups as after each round
# half of the teams are eliminated
while total_match_ups != 0:
for match_up in range(1, total_match_ups + 1):
for game_number in range(1, 8):
game_id = self.get_game_id(season, Gametype.PLAYOFFS.value,
f'{str(round_num).zfill(2)}{match_up}{game_number}')
self.scrape_data(game_id, Gametype.PLAYOFFS.name, loc)
total_match_ups = total_match_ups // 2
round_num += 1
The function get_play_by_play_data needs to be called with the download path, seasons and their corresponding
number of games and the game types for which the data is to be downloaded.
Here is an example of how this function can be called:
from make_dataset import Gametype, scrape_nhl_data
scraper = scrape_nhl_data()
season_data = {'2016': 1230, '2017': 1271, '2018': 1271,'2019': 1271,'2020': 868}
game_types = [Gametype.REGULAR.name, Gametype.PLAYOFFS.name]
data = scraper.get_play_by_play_data(path='', seasons_to_game_volume_map=season_data,game_types=game_types)
Question 2
…content…